One X Variable

Module 3.1: Correlation

Author
Affiliation

Alex Cardazzi

Old Dominion University

All materials can be found at alexcardazzi.github.io.

Economic Relationships

Thus far, we have mostly described single variables. We have examined a variable’s distribution (central tendency, dispersion), made claims about the population it originates from (confidence intervals), and tested hypotheses we’ve developed. However, we are not usually interested in single variables by themselves. More often, economists are interested in relationships between two (or more) variables.

When two variables are in fact related, learning about one of them reveals something about the other one. For example, consider SAT and ACT scores. If you found out what your friend scored on the SAT, you could probably guess their ACT score within a few points.

The point of this module is to learn how to quantify and exploit these types of relationships.

Economic Models

Economists often create simplified theoretical models of the world to study how it works. A famous example of an economic model is the law of demand.

The quantity demanded of a product or service is a function of, or determined by, its price. As price increases, quantity demand will decrease.1

Of course, there are a lot of factors that impact the quantity demanded of a product or service. However, creating models that account for every little factor would be paralyzing. So, economists focus on what they believe to be the most important factors.

What are some general characteristics of economic models?

Economic models identify causal forces most relevant to a decision-maker’s behavior. It then follows that these models are therefore predictive. In the case of demand, price causes quantity demanded.

Economic models are intended to characterize important determinants of average behavior, not capture every little nuance.

Economic models generally describe direction of change, but not necessarily magnitude of change. In the case of demand, increases in price lead to decreases in quantity demanded.

When creating an economic model, one needs to consider two kinds of variables:

A variable that is determined outside of the system. In other words, these variables are as good as random. In the case of demand, this would be the price.

A variable that is determined inside of the system. These variables can be explained by, of functions of, the exogenous variables. In the case, of demand, this would be quantity demanded.

As a shortcut, you can think of “endogenous” variables as \(Y\) variables, and “exogenous” variables are \(X\) variables insofar that \(X\) causes \(Y\).

Once a theory is developed, it becomes important to test the theory. This is where econometrics comes in.

Before we get too far into the weeds:

  • There are different names for exogenous variables. If you see any of the following terms, they are all referring to exogenous variables: “control variables”, “independent variables”, \(x\)-variables, regressors, covariates, explanatory variables.
  • Generally, endogenous variables will be called \(y\)-variables, outcome variables, or dependent variables.

Before jumping to data, economists come up with mathematical expressions to explore their theories. As a simple example, consider the law of demand. In words, we have two conditions:

  1. Quantity demanded is determined by price.
  2. Quantity demanded decreases when price increases.

To “mathematize” the first condition, we can write \(Q_d\) as a function of \(P\): \(Q_d = f(P)\).The second condition implies something about \(f(P)\). Specifically, it suggests that the first derivative should be negative. Mathematically: \(\frac{\partial Q_d}{\partial P} < 0\).

There are many (infinite) functions one could choose from that would satisfy the two important conditions. However, to reduce a model’s complexity, economists will often “guess” particular functional forms for \(f()\). Perhaps the most simple function that relates \(Q_d\) and \(P\) in a way that satisfies the conditions is a linear relationship. Below is an example of a linear relationship between quantity demanded and price:

Plot

Plot of the law of demand.

This line satisfies all of the theory of our law of demand.3 Here, price and quantity demanded are linked via this function, and the function has a negative slope.

It is important to note:

“All models are wrong, but some are useful.”

All models are going to be wrong because they cannot account for every possible scenario. For example: 2020’s pandemic, 2008’s financial crisis, and 2000’s dot com bubble are all examples of generally unforeseeable circumstances that escaped economic models. However, some models provide valuable insight into how the world operates. We will use math and statistics to help us decides which are at least useful.

Let’s suppose my theory is that students with higher GPAs in high school will score higher on their SAT. Now, we need a way to test this theory.

Let’s pick out our exogenous and endogenous variables. I contend that SAT scores are a function of, or are determined by, high school GPA. In reality, both of these outcomes are likely to be determined by things like natural ability, study habits, family income (i.e., tutoring, etc.), educational environment, etc. However, for now, let’s think about GPA as a proxy for all of these things, and SAT scores follow as a direct result of GPAs.

Another (and more realistic) way to think about this model is as a tool for guidance counselors to help their students understand what they can expect for their SAT scores given their GPA.

In this dataset, we observe SAT and GPA data for 1000 students (data, documentation) to examine the potential relationship.

As a first step, we should read in the data and take a look at it.4

Code
df <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/satgpa.csv")
head(df)
Output
  X sex sat_v sat_m sat_sum hs_gpa fy_gpa
1 1   1    65    62     127   3.40   3.18
2 2   2    58    64     122   4.00   3.33
3 3   2    56    60     116   3.75   3.25
4 4   1    42    53      95   3.75   2.42
5 5   1    55    52     107   4.00   2.63
6 6   2    55    56     111   4.00   2.91

Next, we should visualize the relationship between the two variables using R’s plot() function.

Code
plot(df$hs_gpa, df$sat_sum, las = 1, pch = 19,
     xlab = "HS GPA", ylab = "SAT Percentile Sum")
Plot

SAT Percentile Sum vs High School GPA

An Aside

This plot looks pretty messy. The points are covering each other, everything is one solid color, and there’s hardly a discernible pattern. One step you can take is to make the points more transparent.

To do this, we are going to install the scales package with the following: install.packages("scales"). Then, we are going to use library("scales") to load in the package. scales contains a function alpha() that accepts vectors of colors and opacities as arguments.

Code
# install.packages("scales")
# I am skipping the install step because I have already done it.
library("scales")
plot(df$hs_gpa, df$sat_sum, las = 1, pch = 19,
     col = alpha("tomato", .2), #.2 is 20%
     xlab = "HS GPA", ylab = "SAT Percentile Sum")
Plot

SAT Percentile Sum vs High School GPA

This is a good first step, as our plots now show a bit of density. However, there are still so many points (1,000 to be exact) that the overall relationship might be obscured.

Notice how there are many points in these vertical lines. This is because GPA is a relatively discrete measure. In other words, there are 35 unique values5 for 1000 observations. We can take advantage of this by collapsing the data down into these 35 points. To do this, we are going to calculate an average SAT for each unique GPA value.

One way we can do this is with a loop!

First, we’ll extract the unique GPA values. Next, we’ll create an empty vector where we will put our SAT averages for each GPA value. Finally, we’ll calculate the mean and store it.

Code
# gather unique GPAs
gpaz <- unique(df$hs_gpa)
# Create empty vector
satz <- rep(NA, length(gpaz))
for(gpa in gpaz){
  
  # Take the average of the subsetted SATs.
  temp <- mean(df$sat_sum[df$hs_gpa == gpa])
  # Store "temp" into satz where gpaz == gpa
  satz[gpaz == gpa] <- temp
}
plot(gpaz, satz, las = 1, pch = 19,
     xlab = "HS GPA", ylab = "SAT Percentile Sum")
Plot

SAT Percentile Sum vs High School GPA

Seeing this plot, one can start to pick out the relationship between GPA and SAT even if only with your eyes. However, having to write a loop to generate these data is unnecessarily difficult. Luckily, R has a function called aggregate(). This will do exactly what we wanted, but with much less typing.

aggregate() accepts a few main arguments:

  • x: The data you want to aggregate/collapse. In our case, sat_sum.
  • by: The level you want to aggregate/collapse to. In our case, hs_gpa.
  • FUN: The function to be applied to x for each unique by. In our case, mean().

Below is an example of how to use aggregate for the exact reason we used the previous loop.

Code
# If you enter elements as lists, you can name them.
# These names will be passed to the data.frame output.
aggregate(x = list(sat = df$sat_sum), # x is a list of what to collapse
          by = list(gpa = df$hs_gpa), # by is a list of what to collpase by
          FUN = mean) -> agg_grades # FUN is the function to use
head(agg_grades) # note: the output is a data.frame, which we are used to.
plot(agg_grades$gpa, agg_grades$sat, las = 1, pch = 19,
     xlab = "HS GPA", ylab = "SAT Percentile Sum")
Output
   gpa       sat
1 1.80  87.00000
2 2.00  89.73333
3 2.03  80.00000
4 2.10 103.55556
5 2.20  91.33333
6 2.25  93.00000
Plot

SAT Percentile Sum vs High School GPA

What makes aggregate() special is the ability to aggregate multiple variables and/or by multiple variables. As an example, let’s find the average sat_sum, sat_m, and sat_v by both hs_gpa and sex.6

Code
aggregate(x = list(sat = df$sat_sum, # collapsing multiple variables
                   math = df$sat_m,
                   verbal = df$sat_v),
          by = list(gpa = df$hs_gpa, # collapsing by multiple variables
                    female = ifelse(df$sex == 1, 1, 0)),
          FUN = mean) -> agg_grades
head(agg_grades)
plot(agg_grades$gpa, agg_grades$sat, las = 1, pch = 19,
     col = alpha(ifelse(agg_grades$female == 1, "tomato", "dodgerblue"), .7),
     xlab = "HS GPA", ylab = "SAT Percentile Sum", cex = 1.5)
legend("bottomright", c("Female", "Male"), pch = 19,
       col = c("tomato", "dodgerblue"), bty = "n", cex = 1.5)
Output
   gpa female       sat     math verbal
1 2.00      0  92.66667 51.16667  41.50
2 2.03      0  80.00000 43.00000  37.00
3 2.10      0 104.00000 55.50000  48.50
4 2.20      0  79.00000 43.00000  36.00
5 2.25      0  84.50000 38.50000  46.00
6 2.30      0  91.00000 44.75000  46.25
Plot

SAT Percentile Sum vs High School GPA

With aggregate(), you can also use other functions like length() to calculate the number of observations that fall within each bucket. Or, you can use sd() to calculate standard deviations. If we calculate mean(), length(), and sd(), we can create confidence intervals. The plot below creates confidence intervals for each GPA.

Code
meanz <- aggregate(x = list(sat = df$sat_sum),
                   by = list(gpa = df$hs_gpa),
                   FUN = mean)
sdz <- aggregate(x = list(sat = df$sat_sum),
                 by = list(gpa = df$hs_gpa),
                 FUN = sd)
nz <- aggregate(x = list(sat = df$sat_sum),
                by = list(gpa = df$hs_gpa),
                FUN = length)

# Note, we cannot calculate a standard deviation when n < 2
# We'll have to drop observations where this is the case.
meanz <- meanz[nz$sat > 1,]
sdz <- sdz[nz$sat > 1,]
nz <- nz[nz$sat > 1,]

ci_upper <- meanz$sat + (1.96 * sdz$sat / sqrt(nz$sat))
ci_lower <- meanz$sat - (1.96 * sdz$sat / sqrt(nz$sat))
plot(meanz$gpa, meanz$sat, las = 1, pch = 19,
     xlab = "HS GPA", ylab = "SAT Percentile Sum",
     ylim = range(ci_upper, ci_lower))
# segments() is a new function for you, so take a look at what it does.
segments(x0 = meanz$gpa, y0 = ci_lower, y1 = ci_upper)
Plot

SAT Percentile Sum vs High School GPA plus Confidence Intervals

Back to Economic Models

Now that we can visually see a relationship, we need to figure out how to quantify it. We are going to be introduced to something called covariance. This is a measure of how two variables move with one another. Let’s work up to the formula with some intuition.

We are going to compare each data point to it’s respective mean. Specifically, we are going to subtract off the mean from each variable. We’ll plot the data, and include vertical and horizontal lines at 0.

Code
df$gpa_0 <- df$hs_gpa - mean(df$hs_gpa)
df$sat_0 <- df$sat_sum - mean(df$sat_sum)
plot(df$gpa_0, df$sat_0, las = 1, pch = 19,
     col = alpha("dodgerblue", .2),
     xlab = "GPA - mean GPA", ylab = "SAT - mean SAT")
abline(h = 0, v = 0)
Plot

Mean of GPA subtracted from GPA vs mean of SAT subtracted from SAT.

If we find that \(x_i - \bar{x} > 0\) and \(y_i - \bar{y} > 0\) (or \(x_i - \bar{x} < 0\) and \(y_i - \bar{y} < 0\)) at the same time more often than not, this is good evidence that the two variables are moving together. In words, if GPA is above (below) its mean at the same time that SAT is above (below) its mean, these variables are moving together. See the figure below for an illustration.

Code
plot(df$gpa_0, df$sat_0, las = 1, pch = 19,
     col = alpha(ifelse(df$gpa_0 * df$sat_0 > 0, "dodgerblue", "black"),
                 # Note that the opacity can depend on some value.
                 ifelse(df$gpa_0 * df$sat_0 > 0, 0.2, 0.2)),
     xlab = "GPA - mean GPA", ylab = "SAT - mean SAT")
abline(h = 0, v = 0)
Plot

Same image as before, except this one is colored based on what quadrent the data is in.

For each point, we can tell if both values are positive (meaning both above the mean) or both values are negative (meaning both below the mean) by multiplying them together. Two positive numbers, or two negative numbers, when multiplied together, will produce a positive number. If one number is negative and the other is positive, then the product will be negative. To simplify this, we can use the sign() formula to output +1, 0, or -1 depending on the sign of the input.

Code
sign(c(-10, 0, 3, -4, 6, 100))
head(sign(df$gpa_0) * sign(df$sat_0))
Output
[1] -1  0  1 -1  1  1
[1]  1  1  1 -1  1  1

If the points are exactly evenly distributed, we would expect the average of sign(df$gpa_0) * sign(df$sat_0) to be equal to 0 (since the sum of -1s and +1s cancel out to zero). However, if the average is positive (negative), this would suggest a positive (negative) relationship.

The mean of this outcome is indeed positive (0.304). Using the table() function, we can see that most of the values fall in the (-1, -1) or (+1, +1) groups:

Code
table(GPA = sign(df$gpa_0), SAT = sign(df$sat_0))
Output
    SAT
GPA   -1   1
  -1 318 151
  1  197 334

Well, instead of using sign(), let’s just use the actual differences. Why is this better than using sign()? First, there are some calculus reasons for this that we won’t discuss. Second, this will weigh observations that are farther from the mean a bit more than ones close to the mean.

The formula for this average looks like the following:

\[\frac{1}{n}\sum_{i = 1}^n (x_i - \bar{x})\times(y_i - \bar{y})\]

Doesn’t this look familiar? Perhaps it would look like the formula for variance if we substitute \(x_i\) for \(y_i\).

\[\frac{1}{n}\sum_{i = 1}^n (y_i - \bar{y})\times(y_i - \bar{y}) = \frac{1}{n}\sum_{i = 1}^n (y_i - \bar{y})^2\]

This formula is called covariance.7 A variable’s covariance with itself is equal to its own variance.

Covariance

What is the covariance of these two different variables?

Code
cat("R's default formula:", cov(df$hs_gpa, df$sat_sum), "\n")
cat("Our custom calculation:", sum(df$gpa_0 * df$sat_0) / (nrow(df) - 1))
Output
R's default formula: 3.333749 
Our custom calculation: 3.333749

Unfortunately, there is not a very intuitive interpretation in words for what this number means. Rather, we mostly interpret the sign of the number. Positive means that the variables move together and negative means the variables move in opposite ways.

Let’s see if we can turn this number into something that is interpretable. Let’s work with the variance formula to start. Suppose we have data on some distance, and our variance is 36 miles\(^2\). To make this number more interpretable, we would take the square root, which would give us a standard deviation of 6 miles. Instead of taking the square root, we also could have divided the variance by the standard deviation which would have given us the same answer. For example: \(\sqrt{\sigma^2} = \frac{\sigma^2}{\sigma} = \sigma\). Let’s do the same thing for our GPA and SAT covariance.

Currently, we have 3.3337488 as our covariance, and the units are GPA units \(\times\) SAT units. To scale this, we can divide by the standard deviation of either variable. But, which variable should we divide by? Por que no los dos?image

Going back to variance, if we divide variance (\(\sigma^2\)) by its standard deviation (\(\sigma\)) twice, we would get \(\frac{\sigma^2}{\sigma \times \sigma} = \frac{\sigma^2}{\sigma^2} = 1\). Remember, the covariance of any variable with itself is equal to its variance. There will be no other variable in the world with a stronger relationship to variable \(y\) than \(y\) itself! Therefore, a value close to 1 indicates that two variables are almost identical. On the other hand, imagine the covariance of \(y\) and \(-y\). In this case, when \(y\) is above its mean, \(-y\) must be below its mean. Calculating the covariance of \(y\) and \(-y\) will result in the same magnitude as the variance of \(y\), but the sign will be negative. Therefore, a value close to -1 indicates that two variables are nearly complete opposites.

So, +1 is the maximum value and indicates the strongest positive relationship possible.

Then, -1 is the minimum value and indicates the strongest negative relationship possible.

Let’s scale the covariance of hs_gpa and sat_sum similarly to how we scaled the variance of \(y\): divide by the standard deviation of both variables.

Before, we calculated \(\frac{cov(y, y)}{\sigma_y \times \sigma_y}\).Now, we’ll calculate \(\frac{cov(\text{GPA}, \text{SAT})}{\sigma_{\text{GPA}} \times \sigma_{\text{SAT}}}\).

This number is a unitless measurement of the strength and direction of the relationship between two variables. This number is called the correlation coefficient, denoted rho (\(\rho\)).

\[ \rho_{x,y} = \frac{cov(x, y)}{\sigma_x \sigma_y}\]

Correlation

Below is a visualization of the previous scenarios (\(y\) vs \(y\), GPA vs SAT, \(y\) vs \(-y\)):

Code
par(mfrow = c(1, 3))

plot(df$hs_gpa, df$hs_gpa, xlab = "GPA", ylab = "GPA",
     main = c("Correlation:", round(cor(df$hs_gpa, df$hs_gpa), 3)),
     col = alpha("dodgerblue", 0.2), pch = 19)
abline(v = mean(df$hs_gpa), h = mean(df$hs_gpa))

plot(df$hs_gpa, df$sat_sum, xlab = "GPA", ylab = "SAT",
     main = c("Correlation:", round(cor(df$hs_gpa, df$sat_sum), 3)),
     col = alpha("mediumseagreen", 0.2), pch = 19)
abline(v = mean(df$hs_gpa), h = mean(df$sat_sum))

plot(df$hs_gpa, -df$hs_gpa, xlab = "GPA", ylab = "-GPA",
     main = c("Correlation:", round(cor(df$hs_gpa, -df$hs_gpa), 3)),
     col = alpha("tomato", 0.2), pch = 19)
abline(v = mean(df$hs_gpa), h = mean(-df$hs_gpa))

par(mfrow = c(1, 1))
Plot

Three images. From left to right, a perfect positive correlation, a correlation of 0.431, and then a perfect negative correlation.

The word “correlation” is one of the few words that have been able to sneak out of statistics textbooks and find its way into ordinary, every day vocabulary. I’m sure you’ve heard or said “this is correlated with that”, or “correlation does not mean causation”, etc. Correlation is a very specifically defined term, and hopefully, you have a deeper understanding of what it really means now.

If two variables (\(A\) and \(B\)) are correlated, this is because:

  1. \(A \rightarrow B\) (\(A\) causes \(B\))
  2. \(B \rightarrow A\)
  3. \(C \rightarrow A\) and \(C \rightarrow B\) (some third factor causes both)
  4. Sometimes, things are correlated by random chance.

Each time you find a correlation between two variables, you should think deeply about all four of these possible reasons.

In our example about GPAs and SAT scores, it is likely that there is a third factor, or quite a few other factors, that determine both someone’s GPA and SAT score. Therefore, the correlation we identified should not be interpreted as causal.

Take a look at this website of spurious correlations for examples of things that are correlated due to chance.

If we are after causation, what is the best way to rule out all of the ways we could find a correlation between \(A\) and \(B\)? Randomization!

Suppose we wanted to know if a new sunscreen prevented sunburn. The best way to answer this question would be to take a group of people and split them into two groups. The first group gets the sunscreen and the other group gets lotion.

Randomization would:

  • Remove the possibility of reverse causality (Story 2). In other words, without randomization, maybe people only use sunscreen once they get a sun burn. This would make it look like sunscreen is causing sunburns, even though it’s the other way around.
  • Imply the treatment and control group differ only by treatment, which rules out the story of a third factor causing the correlation (Story 3). Without randomization, since younger people are generally more risky, they are less likely to use sunscreen. Additionally, younger people are also more likely to spend time outside which leads to more sunburns. Therefore, we would find that sunscreen leads to less sunburn, but this would be due to age influencing both variables.
  • Ensure any correlation is not driven by statistical chance once the sample size is large enough (Story 4).

Calculating correlations in R can be done with the cor() function. This outputs a correlation coefficient when you enter two vectors. Moreover, you can feed cor() a data.frame, and it will calculate a correlation matrix. The matrix contains the correlation coefficients calculated from every combination of variables.

Code
cor(df$sat_sum, df$hs_gpa)
cor(df[,c("sat_v", "sat_m", "sat_sum", "hs_gpa", "fy_gpa")])
Output
[1] 0.4307883
            sat_v     sat_m   sat_sum    hs_gpa    fy_gpa
sat_v   1.0000000 0.4665806 0.8522618 0.3595001 0.4013904
sat_m   0.4665806 1.0000000 0.8603333 0.3780704 0.3871178
sat_sum 0.8522618 0.8603333 1.0000000 0.4307883 0.4602810
hs_gpa  0.3595001 0.3780704 0.4307883 1.0000000 0.5433535
fy_gpa  0.4013904 0.3871178 0.4602810 0.5433535 1.0000000

Footnotes

  1. This is not to say that increases in quantity demanded can influence price.↩︎

  2. One of the main points of this course is to turn qualitative economic models into quantitative statistical models.↩︎

  3. As a note, 99% of the time, the variable that determines another is placed along the \(x\)-axis. Of course, it then follows that the dependent variable is then placed along the \(y\)-axis. However, Alfred Marshall popularized the convention of price being on the \(y\)-axis even though this is usually thought of as the variable that determines quantity demanded. See stackexchange for a more detailed explanation if interested. This is one of the few times you will see the \(x\) variable on the \(y\) axis.↩︎

  4. Note that the sat_sum variable is a sum of the students Math and Verbal percentiles. Therefore, the best student in the world would be 100\(^{\text{th}}\) percentile for both, meaning sat_sum would be 200. A median student in each would have a value of 50 and 50, making sat_sum equal to 100.↩︎

  5. I calculated this via: length(unique(df$hs_gpa))↩︎

  6. I am going to make an assumption about the sex variable. If sex is 1, this denotes a female.↩︎

  7. In both formulas, I am dividing by \(n\) for illustration purposes. However, the correct formula is \(n-1\) as the denominator.↩︎